Port CMU 15-445 to the bench by Jackcuii · Pull Request #97 · sys-intelligence/system-intelligence-benchmark

Jackcuii · 2026-01-27T22:37:59Z

This is a Draft PR

Description

This PR adds CMU 15-445 Lab 0 (Count-min Sketch) to the Benchmark Suite. The task requires implementing a thread-safe Count-min sketch data structure, a probabilistic data structure used for frequency estimation in streaming data. This lab focuses on C++ programming, concurrency, algorithms, and database systems concepts.

Changes

Added new task directory data/cmu_15-445/task_cpp/ with complete lab setup

Testing

E2E Tested with Claude Haiku

TODOs

P1
P2
P3
P4

Distinguish the models used in the executor and evaluator

Signed-off-by: Tarek <tareknaser360@gmail.com>

…m changes

…s/sysmobench/sysmobench_core'

- Add gpt-4o model configuration to models.yaml - Fix setup_tools.py to use shutil.move instead of os.rename This resolves 'Invalid cross-device link' error when /tmp is on different filesystem

Signed-off-by: Tarek <tareknaser360@gmail.com>

* added cmu15-213 data lab * docs(courselab): add note about infrastructure restrictions Signed-off-by: Tarek <tareknaser360@gmail.com> --------- Signed-off-by: Tarek <tareknaser360@gmail.com> Co-authored-by: Tarek <tareknaser360@gmail.com>

* add cs537 fall 2021 final exam * add institution * fix * add solutions * update metadata * add choice array * avoid extra restrictions on LLM output Signed-off-by: Tarek <tareknaser360@gmail.com> --------- Signed-off-by: Tarek <tareknaser360@gmail.com> Co-authored-by: Tarek <tareknaser360@gmail.com>

Signed-off-by: Tarek <tareknaser360@gmail.com>

tareknaser

Thanks for the great work. This looks almost ready to merge. I made a few small updates including adding a course entry and a reference solution (based on Claude’s trajectory) and rebasing on top of main. I’ll add a couple more minor updates in separate comments for you to review.

If everything looks good, we can go ahead and merge

tareknaser · 2026-01-28T23:59:44Z

benchmarks/courselab_bench/data/cmu_15-445/task_cpp/preprocess.sh

Do you think we can simplify this file to be

#!/bin/bash set -e echo "=== Setting up CMU 15-445 CountMinSketch Lab ===" cd /workspace echo "Installing git" apt-get update > /dev/null 2>&1 apt-get install -y git > /dev/null 2>&1 echo "Cloning bustub repository" git clone https://github.com/cmu-db/bustub.git /tmp/bustub > /dev/null 2>&1 git -C /tmp/bustub checkout bd3912741c45370d5f9c7bef638452b10b140138 > /dev/null 2>&1 echo "Moving source to workspace" mv /tmp/bustub/* ./ mv /tmp/bustub/.clang-format ./ 2>/dev/null || true mv /tmp/bustub/.clang-tidy ./ 2>/dev/null || true rm -rf /tmp/bustub .git echo "Installing build dependencies" build_support/packages.sh -y > /dev/null 2>&1 echo "Creating checksums for protected files" mkdir -p /tmp/checksums sha256sum test/primer/count_min_sketch_test.cpp > /tmp/checksums/test.sha256 echo "Building project" mkdir -p build && cd build cmake -DCMAKE_BUILD_TYPE=Debug .. > /dev/null 2>&1 make -j$(nproc) > /dev/null 2>&1 echo "Setup complete" echo "Agent should implement:" echo " - src/include/primer/count_min_sketch.h" echo " - src/primer/count_min_sketch.cpp"

tareknaser · 2026-01-29T00:00:41Z

benchmarks/courselab_bench/data/cmu_15-445/task_cpp/evaluate.sh

And the evaluation script to be

#!/bin/bash set -e cd /workspace # Verify test file wasn't modified echo "Verifying protected files were not modified" if ! sha256sum -c /tmp/checksums/test.sha256 > /dev/null 2>&1; then echo "FAIL: test/primer/count_min_sketch_test.cpp was modified" exit 1 fi echo "Protected files unchanged" # Build echo "" echo "=== Building ===" rm -rf build mkdir build && cd build cmake -DCMAKE_BUILD_TYPE=Debug .. > /dev/null 2>&1 if ! make -j$(nproc); then echo "FAIL: Build failed" exit 1 fi # Run tests echo "" echo "=== Running Tests ===" make -j$(nproc) count_min_sketch_test > /dev/null 2>&1 if ! ./test/count_min_sketch_test; then echo "FAIL: Tests failed" exit 1 fi # Format check echo "" echo "=== Format Check ===" make format > /dev/null 2>&1 if ! make check-clang-tidy-p0; then echo "FAIL: clang-tidy check failed" exit 1 fi echo "" echo "PASS: All checks passed" exit 0

There is no need to have scoring scheme since we just report pass/fail. What do you think?

Jackcuii · 2026-01-30T03:16:30Z

Thanks for the great work. This looks almost ready to merge. I made a few small updates including adding a course entry and a reference solution (based on Claude’s trajectory) and rebasing on top of main. I’ll add a couple more minor updates in separate comments for you to review.

If everything looks good, we can go ahead and merge

Thank you Tarek! I will add more tests to this PR to scale it up~

xuafeng · 2026-02-02T18:33:34Z

@Jackcuii For tests, do you mean that we can have more? Can you please add more tests as soon as possible? We need to merge this PR. Thanks a lot.

Jackcuii · 2026-02-02T19:08:44Z

@Jackcuii For tests, do you mean that we can have more? Can you please add more tests as soon as possible? We need to merge this PR. Thanks a lot.

Hi Xuan！

Yes we can have more! Sorry for being late, I am heading back home these days. I will push hard after I arrive home on 4😃.

I possibly need to change the workflow of test to a 'consecutive test' which means I need to run the all 4 tests left continuously.

That is because the lab2,3,4 of 15-445 needs to be based on the last lab. However, we do not have golden version of the project. So we need to make the agent consecutively work on the 4 labs in one go.

xuafeng and others added 30 commits November 5, 2025 18:10

Rename it "System Intelligence Benchmark"

6d24e69

Init: Initialize SysMoBench benchmark integration

87db1e9

feat: Add gitigore

69f4cb5

feat: Add prototype for phase 1&2

843f031

feat: Distinguish evaluator and model API keys in env.toml

0d2b38f

feat: Add validation for required evaluator API keys

b2acaa7

doc: update README.md

ca7e72e

initial ArtEval commit

ec7b57f

Merge pull request #2 from systemintelligence/feat/distinguish-api-keys

a607e73

Distinguish the models used in the executor and evaluator

feat: Add test

60d30e0

featr: Add install.sh

ff96313

adding overview and contributor's guide

1799370

skeleton ArtEval agent implementation

2054314

adding sosp24 wasabi

6303aa5

docs: add arteval to main README

a5358dc

Signed-off-by: Tarek <tareknaser360@gmail.com>

feat(ci): add GH Actions workflow for running benchmarks tests

904374e

Signed-off-by: Tarek <tareknaser360@gmail.com>

feat: add issue and pull request templates

40ccf1f

Signed-off-by: Tarek <tareknaser360@gmail.com>

fix(ci): add a test for example_bench

4130c7a

Signed-off-by: Tarek <tareknaser360@gmail.com>

fix: shell scripts to be executable

3af5b70

Signed-off-by: Tarek <tareknaser360@gmail.com>

docs: update README with instructions for running a single benchmark

156c77c

Signed-off-by: Tarek <tareknaser360@gmail.com>

docs: a note on docker image arch support

a0557f9

Signed-off-by: Tarek <tareknaser360@gmail.com>

meta: add outputs directories to gitignore

868da59

Signed-off-by: Tarek <tareknaser360@gmail.com>

feat(ci): add release trigger to workflow

ea9b54d

Signed-off-by: Tarek <tareknaser360@gmail.com>

fix: Use tla_specification instead of generated_text to adapt upstrea…

5ef835f

…m changes

Merge commit '04900168e10834f3aa5eef4d13b318e1efcdac24' as 'benchmark…

c5dfbb1

…s/sysmobench/sysmobench_core'

fix: Add gpt-4o config and fix cross-device link issue in setup_tools

a68e171

- Add gpt-4o model configuration to models.yaml - Fix setup_tools.py to use shutil.move instead of os.rename This resolves 'Invalid cross-device link' error when /tmp is on different filesystem

fix: Convert GenerationOutput to GenerationResult for evaluators

984336a

docs: Update README and install script for Git Subtree integration

68025ca

feat: Add docker file

25c6af8

fix: Add env.toml

97c1c3c

tareknaser and others added 21 commits January 12, 2026 12:53

courseexam: add 6.5840 spring 2025 exam 1

436242c

Signed-off-by: Tarek <tareknaser360@gmail.com>

[lab] added cmu15-213 "data lab" (#62)

5c6b412

* added cmu15-213 data lab * docs(courselab): add note about infrastructure restrictions Signed-off-by: Tarek <tareknaser360@gmail.com> --------- Signed-off-by: Tarek <tareknaser360@gmail.com> Co-authored-by: Tarek <tareknaser360@gmail.com>

correct the number of questions for cs537_fall_2021_final

db8c391

Signed-off-by: Tarek <tareknaser360@gmail.com>

feat(courseexam): add support for evaluation infrastructure

6a42caa

Signed-off-by: Tarek <tareknaser360@gmail.com>

feat(courseexam): deprecate points_mean metric

cacf763

Signed-off-by: Tarek <tareknaser360@gmail.com>

feat(ci): add experimental courseexam benchmark validation workflow

fd446d6

Signed-off-by: Tarek <tareknaser360@gmail.com>

docs(ci): include a link to inspect log viewer docs

619151e

Signed-off-by: Tarek <tareknaser360@gmail.com>

feat(courselab_bench): refactor infrastructure to use inspect-ai

488e8eb

Signed-off-by: Tarek <tareknaser360@gmail.com>

fix(courselab): handle binary starter files separately

c6cf5d1

feat: update versioning for courseexam and courselab benchmarks

376b056

docs: add note on private contributions and access to the repository

ba92c3c

feat(courselab): add solution validation using a reference script

0ce3ef8

feat(courselab): add reference solutions for existing labs

5b08284

feat(courselab): add more verbose tags

1ef35f4

feat(courselab): specify git commit when cloning

fe752b8

feat(courselab): specify go binary path in task description

1fff943

docs(courselab): add a section in the README for best practices

40b1061

docs: mention the leaderboard for courseexam and courselab benchmarks

c3801f9

add lab0 task for cmu_15-445

0005bac

feat(courselab): add reference solution for cmu_15_445

23249ee

tareknaser force-pushed the port-445 branch from 412dbe8 to 23249ee Compare January 28, 2026 23:58

tareknaser approved these changes Jan 29, 2026

View reviewed changes

xuafeng mentioned this pull request Feb 2, 2026

Add CMU 15-445 #100

Open

tareknaser force-pushed the main branch from 57b962d to a1780ed Compare February 5, 2026 16:46

Jackcuii closed this Feb 9, 2026

Jackcuii mentioned this pull request Feb 9, 2026

Port CMU 15-445 #128

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port CMU 15-445 to the bench#97

Port CMU 15-445 to the bench#97
Jackcuii wants to merge 194 commits intomainfrom
port-445

Jackcuii commented Jan 27, 2026

Uh oh!

tareknaser left a comment

Uh oh!

tareknaser Jan 28, 2026

Uh oh!

tareknaser Jan 29, 2026

Uh oh!

Jackcuii commented Jan 30, 2026

Uh oh!

xuafeng commented Feb 2, 2026

Uh oh!

Jackcuii commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Conversation

Jackcuii commented Jan 27, 2026

Description

Changes

Testing

TODOs

Uh oh!

tareknaser left a comment

Choose a reason for hiding this comment

Uh oh!

tareknaser Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

tareknaser Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Jackcuii commented Jan 30, 2026

Uh oh!

xuafeng commented Feb 2, 2026

Uh oh!

Jackcuii commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants